Neutrophil degranulation, NETosis and platelet degranulation pathway genes are co-induced in whole blood up to six months before tuberculosis diagnosis

Mycobacterium tuberculosis (M.tb) causes tuberculosis (TB) and remains one of the leading causes of mortality due to an infectious pathogen. Host immune responses have been implicated in driving the progression from infection to severe lung disease. We analyzed longitudinal RNA sequencing (RNAseq) data from the whole blood of 74 TB progressors whose samples were grouped into four six-month intervals preceding diagnosis (the GC6-74 study). We additionally analyzed RNAseq data from an independent cohort of 90 TB patients with positron emission tomography-computed tomography (PET-CT) scan results which were used to categorize them into groups with high and low levels of lung damage (the Catalysis TB Biomarker study). These groups were compared to non-TB controls to obtain a complete whole blood transcriptional profile for individuals spanning from early stages of M.tb infection to TB diagnosis. The results revealed a steady increase in the number of genes that were differentially expressed in progressors at time points closer to diagnosis with 278 genes at 13–18 months, 742 at 7–12 months and 5,131 detected 1–6 months before diagnosis and 9,205 detected in TB patients. A total of 2,144 differentially expressed genes were detected when comparing TB patients with high and low levels of lung damage. There was a large overlap in the genes upregulated in progressors 1–6 months before diagnosis (86%) with those in TB patients. A comprehensive pathway analysis revealed a potent activation of neutrophil and platelet mediated defenses including neutrophil and platelet degranulation, and NET formation at both time points. These pathways were also enriched in TB patients with high levels of lung damage compared to those with low. These findings suggest that neutrophils and platelets play a critical role in TB pathogenesis, and provide details of the timing of specific effector mechanisms that may contribute to TB lung pathology.


Introduction
Mycobacterium tuberculosis (M.tb), which causes tuberculosis (TB), remains one of the leading pathogens that is responsible for human death [1]. Although estimates suggest that 23% of the world's population are infected with M.tb [1], most individuals are able to eradicate or control the disease [2] and only 5-10% develop TB during their lifetime [1].
It is largely unknown why some M.tb-infected individuals progress to active TB, but an over-active inflammatory response is considered an important factor contributing to lung pathology [3,4]. Rapid necrosis, associated with a delayed-type hypersensitivity reaction against accumulated M.tb antigens [5], or an M.tb-mediated autoreactive response [4], is thought to cause an inflammatory response that drives lung extracellular matrix (ECM) destruction and cavity formation. This allows M.tb in the lung interstitium to access the airways and be transmitted [6].
Neutrophils are strongly activated in response to M.tb infection [7,8], although numerous studies have shown they are ineffective at killing or controlling M.tb replication (for review see [9]). Rather, the activation and infiltration of neutrophils at late stages of infection is associated with TB pathogenesis [9][10][11][12][13]. Indeed, neutrophils are the predominant immune cell type present in lung lesions and cavities of pulmonary TB patients and are associated with lung ECM destruction and cavity formation [12,14]. Neutrophils contain a diverse array of preformed proteases in their granules including neutrophil collagenase and matrix metallopeptidase 8 (MMP8), which digest ECM in human lung [12] and MMP8 is the most prevalent MMP present in the sputum of TB patients [15]. M.tb-induced MMP8 secretion is also associated with the secretion of neutrophil extracellular traps (NETs); a process by which neutrophils release their antimicrobial granule proteases, DNA and histones extracellularly in a type of programmed cell death named NETosis [16]. Collectively, these studies strongly implicate neutrophils in late stages of TB lung pathology including ECM destruction and cavity formation.
Whole blood transcriptomic studies have advanced our knowledge of the host response to many diseases, including TB [11,[17][18][19]. Previous transcriptomics studies identified a number of immune processes activated in TB patients including interferon (IFN)-signaling [11,17,20], myeloid cell inflammation [21], and the inflammasome and proinflammatory pathways [17]. Two large cohort studies that monitored individuals at high risk for developing TB for up to two years before diagnosis generated whole blood transcriptional profiles for TB progressors with the aim to develop predictive transcript-based risk signatures [22,23]. Subsequently, data from one of these studies, the South African adolescent cohort (aged 12-18 years) [23], was analyzed to identify biological pathways and processes that are active during TB progression [24,25]. These studies found that type I IFN signaling and the complement cascade were the main pathways activated at early stages of TB progression as observed in TB patients [24,25] and changes in functionally uncharacterized neutrophil and platelet gene modules occurred at times closer to TB diagnosis [25]. Here we extended the prior analyses by jointly analyzing two data sets and dissecting the neutrophil processes.
We performed a differential expression (DE) and comprehensive pathway analysis of whole-blood RNA sequencing (RNAseq) datasets obtained from the Gene Expression Omnibus (GEO) [26]. The first set, GEO series GSE94438 (GC6-74 study), was generated from individuals for a period of up to two years prior to TB diagnosis [22]. The second data set, GEO series GSE89403 (Catalysis TB Biomarker Study, hereafter the Catalysis study), compared TB patients to healthy controls. Collectively, the data sets provide a full spectrum of host transcriptional responses spanning from early infection stages to TB diagnosis. Our analysis focused on identifying host responses that may contribute to the development of TB.

Study groups and data quality
We performed a DE and comprehensive pathway analysis of whole-blood RNAseq data sets obtained from the GEO.
The first data set included RNAseq data for TB progressors (GEO:GSE94438) and was generated as part of the Grand Challenges 6-74 (GC6-74) program. The program was a longitudinal study of household contacts of newly diagnosed, sputum smear-positive TB cases in a high TB-prevalence settings and included samples from four different African populations. A detailed description of the study groups has been published previously [22,23,27,28]. Contacts diagnosed with TB within 3 months of recruitment were considered to have prior disease and were excluded from the study, while those developing disease after this point were considered to have true incident TB and were included as cases. Due to the finite follow-up some of the subjects classified as non-progressor might have had pre-clinical TB. Such subjects in the "control" group would dilute the disease signals, possibly obscuring some insights, but would not alter the fundamental findings.
Since the aim of the present study was to identify genes that are DE at different stages of TB progression, the progressor samples were labelled according to the time the sample was collected before their TB diagnosis. As an example, if a participant was diagnosed with TB 8 months after enrollment and they had samples collected at enrollment (time 0) and 6 months, these samples would be classified as 8 and 2 months before diagnosis, respectively. The labelled samples were subsequently placed into groups that corresponded to 19-24, 13-18, 7-12 and 1-6 months before TB diagnosis (longitudinal data for subjects who were diagnosed with TB were aligned to the time of diagnosis and specimens taken before diagnosis were binned into the aforementioned groups). The number of samples in each study group, and mean time before TB diagnosis for each time point are listed in Table 1. For the GC6-74 study, all the TB progressor groups were compared to the same non-TB control samples that were collected at recruitment time. This ensured that any differences between the progressor groups were a result of changes in their respective expression rather than changes in the baseline group.
The second data set used included a subset of RNAseq data from the South African Catalysis study (GEO: GSE89403). Here, we only used RNAseq data from TB patients at diagnosis and the non-TB controls [27,28]. This data set also contained positron emission tomography-computed tomography (PET-CT) data for the TB patients which we used to categorise the patients (see Methods for details) into groups with high (high PET scores) and low (low PET scores) levels of lung damage. The primary analysis for the Catalysis study data set compared all TB patients to controls. We additionally performed a number of PET-score based sub-analyses where patients with high and low PET scores were individually compared to controls, and patients with high and low PET scores were directly compared to identify differences between them.
The FastQC analysis revealed that the read quality was good with no adapter contamination, therefore, no trimming was performed. Mapping statistics revealed that for the GC6-74 cohort, a mean of 47

Differential expression analysis
The edgeR DE analysis identified several significantly perturbed genes across the different time points (Table 2). For the progressors in the GC6-74 study [23], there was a steady increase in the number of DE genes detected at time points more proximal to TB diagnosis with only five genes at 19-24 months, 278 at 13-18 months, 742 at 7-12 months and 5,131 at 1-6 months before diagnosis. A total of 9,205 genes were DE in TB patients compared to controls (Catalysis study). In the Catalysis sub-group analysis, in comparison to controls, a similar number of genes were significantly DE in patients with high (9,211) and low (8,496) PET-scores (Table 2). Notably, a total of 2,144 significantly DE genes were identified when directly comparing patients with high and low PET-scores confirming differences in transcriptional signatures between TB patients with high and low levels of lung damage. The DE results for all genes and time points are presented in S2 Table. There was substantial overlap in the DE genes with 92% (445/492) of the genes upregulated 7-12 months before diagnosis also being upregulated in the samples taken 1-6 months prior to TB diagnosis. Similarly, 85% (2,365/2,791) of the genes upregulated 1-6 months before diagnosis in GC6-74 study subjects were also upregulated in Catalysis study TB patients ( Fig 1A). The high level of overlap in genes upregulated 1-6 months before diagnosis with those upregulated in TB patients illustrates substantial homogeneity in the host response between the two cohorts and provides support for the validity in their comparisons.
In the Catalysis study PET categorized sub-analysis, the TB high PET score group vs controls shared 91% (3,818/4,200) while the TB low PET score group vs controls shared 94% (3,957/4,200) of the upregulated genes identified in the complete analysis ( Fig 1B). Of the 995 genes that were upregulated when directly comparing the TB high PET score group to the TB low PET score group, 95% (943/995) were upregulated when comparing the TB high PET score group to controls and 78% (773/995) were upregulated when comparing the TB low PET score group to controls. Thus, 773 genes that were upregulated when comparing the TB low PET score group to controls, were further significantly upregulated in TB patients with high PET scores compared to those with low PET scores.
Since the increased sample size at time points more proximal to TB diagnosis could be responsible for the increased number of DE genes observed, we performed subsampling to confirm that true biological differences were responsible for the increased number of genes. The subsampling of the 1-6 months before diagnosis group revealed that on average 1,454 genes were upregulated across 1,000 iterations of 19 samples compared to 2,791 upregulated genes detected with a sample size of 47 (S1 Fig). For the TB group, a mean of 3,510 genes were upregulated across 1,000 iterations of 19 samples compared to 4,200 upregulated genes detected with a sample size of 90 (S2 Fig). The higher mean number of upregulated genes detected when subsampling both groups compared to the 13-18 (249; n = 19) and 7-12 (492; n = 18) months before diagnosis groups (Table 2) indicated that the increased number of DE genes observed was predominately a consequence of the time point. This is also supported by almost double the number of genes detected in the 7-12 compared to the 13-18 months before diagnosis groups which had similar sample sizes. The results also confirmed that larger sample numbers improve statistical power and facilitate the detection of additional DE genes.

Gene Ontology (GO) and KEGG pathway analysis
Since only five genes were DE 19-24 months before diagnosis (S2 Table), the pathway analysis was performed on the 13-18, 7-12 and 1-6 months before diagnosis time points and on TB patients. The use of the weighted algorithm (weight01) in topGO identified the most specific and thus informative GO terms including several not previously reported during TB progression. In addition, only pathways that had a minimum of 10 DE genes were considered. In summary, the pathway analysis of the upregulated genes identified several defense-related pathways that were enriched early (13-18 months before diagnosis), remained elevated and intensified, based on increased numbers of annotated genes, at time points more proximal to TB diagnosis. Several additional processes that were specifically related to neutrophil and platelet defense responses were first enriched 1-6 months before diagnosis and remained enriched in TB patients.
A list of the GO and KEGG pathways significantly perturbed at different stages of TB progression is provided in S2 and S3 Tables. Here, in the main text, we focus on the induction of neutrophil-mediated defenses and related pathways and processes since we consider these the most likely drivers of disease progression.

One to six months before diagnosis
A total of 5,131 genes were significantly DE 1-6 months before TB diagnosis with respect to household controls with 2,791 (54%) up-regulated ( Table 2). The marked increase in the number of DE genes was notably accompanied by a corresponding enrichment in several effector pathways that were not enriched at earlier time points documenting a strong and distinct activation of the host immune response at this time point. This response was dominated by a strong enrichment in up-regulated genes that function in neutrophil-mediated immunity with "neutrophil degranulation" [p<10 -30 , 222/464 (48%) annotated genes] being the most significant and specific GO biological process (BP) term detected ( Fig 2B). Other enriched pathways related to neutrophil function included "neutrophil chemotaxis", "Fc-gamma receptor signaling pathway involved in phagocytosis" and "platelet degranulation". Several terms related to the production of reactive oxygen species (ROS) were also enriched including "positive regulation of superoxide anion generation", "respiratory burst", "mitochondrial electron transport" and "cellular oxidant detoxification". The KEGG "NET formation" and "tuberculosis" disease pathways were also enriched at this time point (Fig 2B). An illustration of the KEGG "NET formation" pathway is presented in Fig 3A and 3B with fold-changes (log2) for significantly upregulated genes displayed for the different time points.

PLOS ONE
Neutrophil transcription activated prior to tuberculosis diagnosis Additional upregulated genes annotated to function in neutrophil degranulation include those encoding components of the superoxide generating NADPH oxidase 2 (NOX2). This included the transmembrane catalytic [cytochrome b-245 -alpha (CYBA) and -beta (CYBB)] and cytosolic regulatory subunits [neutrophil cytosolic factor (NCF)1, 2 and 4 and the small Gprotein, RAC2 [33,34]. The induction of genes encoding components of NOX2 along with the enrichment of pathways involved in superoxide generation and the respiratory burst, is consistent with the enrichment of FcγR mediated phagocytosis which activates NOX2 assembly [34,35]. NOX2 generated ROS also induce the synthesis of NETs [36,37].
The co-induction of "platelet degranulation" with "neutrophil degranulation", first observed 1-6 months before diagnosis, is consistent with the physical interactions and costimulatory roles platelets and neutrophils share during immune responses [38,39].

TB versus healthy controls
A total of 9,205 genes were DE in individuals with TB in contrast to the non-TB time zero baseline group from the GC6-74 study, of which 4,200 (46%) were up-regulated (Table 2).  Table 1). Fold-changes for Catalysis TB-patients, categorized with high and low PET scores, were also compared individually to controls and directly to each other to identify differences between the categorized groups. https://doi.org/10.1371/journal.pone.0278295.g004 Similar to 1-6 months before diagnosis, there was a strong enrichment in pathways related to neutrophil-mediated immunity with "neutrophil degranulation" [p<1x10 -30 , 305/464 (66%) annotated genes] being the most significant and specific GO BP term detected (Fig 2A). Approximately 95% (210/222) of the neutrophil degranulation ( Fig 5A) and 97% (60/62) of NET formation (Fig 5B). annotated genes that were upregulated 1-6 months before diagnosis were also upregulated in TB patients. All the other enriched neutrophil related functions observed 1-6 months before diagnosis were also enriched in TB patients.

Catalysis PET-CT sub-analysis
Unsurprisingly, given the high level of overlap between the genes, the results of the pathway analysis were very similar to those observed when comparing the complete TB group to controls, with "neutrophil degranulation", "platelet degranulation" and "NET formation" being significantly enriched in both the TB-high-PET score group and the TB-low-PET score group  Table 3)], revealing that TB patients with increased lung damage have an enrichment in upregulated genes that function in these pathways. Annotated neutrophil degranulation genes that were significantly upregulated in TB patients with high PET scores compared to those with low PET scores included MMP8, S100A12, S100A8, S100A9, CAMP and PPBP (Fig 4).

Discussion
Our results revealed an early induction of IFN-related signaling at 18 months before diagnosis. A strong induction of neutrophil and platelet degranulation and NETosis related genes was detected 6 months before TB diagnosis and persisted in TB patients supporting a pathogenic role of these responses in disease development. Although these processes have previously been detected in patients diagnosed with active TB, here we document that they occur well before diagnosis, indicating that they may be critical in mediating lung tissue destruction and progression from infection with M.tb to active TB.  DE and pathway enrichment analysis revealed that an induction of the host defense response, that is dominated by type I and II IFN signaling, was first detectable in whole blood of TB progressors as early as 13-18 months before diagnosis. A progressive induction in the intensity of the IFN response was observed as time points approached TB diagnosis with a distinct induction of neutrophil-mediated defense observed 1-6 months before diagnosis that persisted and intensified in patients diagnosed with TB. Given that we examined transcriptional responses in whole blood, changes observed are likely to result from signaling molecules released from infected cells, most likely in the lungs, rather than through direct contact with the pathogen. The signaling is likely to prime circulating leukocytes for activation in preparation for their recruitment to infection sites.
A unique and striking finding was the strong induction of specific neutrophil-mediated defenses, including neutrophil degranulation, and NET formation, that are first observed 1-6 months before diagnosis (Fig 2B), maintained in TB patients and significantly enriched in TB patients with high levels of lung damage compared to those with low level damage assessed by PET-CT (S4 Table). The induction of neutrophil-mediated defenses is consistent with literature that associates neutrophil activation and infiltration with TB as well as pulmonary destruction [9,11,13]. Neutrophils have been reported to be the predominant immune cell type present in the sputum and at the site of infection in the lung [40], and are also the main cell type infected with M.tb in sputum and lung cavities [14]. In addition, neutrophil markers are associated with necrotic areas in granulomas [41] and excessive neutrophil infiltration is associated with the softening of caseous lesions in the lung [5].
Although the neutrophil degranulation and NET formation processes are both enriched 1-6 months before diagnosis and in TB patients, the additional increase in the expression level and induction of additional genes that function in these pathways in TB patients suggests a sequential induction of neutrophil activation during the late stages of disease progression. Genes induced 1-6 months before diagnosis encode neutrophil membrane integrins and chemotactic receptors that mediate their adhesion and transmigration to infection sites as well as components of NOX2 [42]. These genes are generally involved in the priming of neutrophils in preparation for full activation at infection sites. In TB patients, there is a distinct induction of genes that encode neutrophil granule proteins that function in ECM degradation including MMP8 and MMP9 and those that encode numerous neutrophil azurophilic granule proteins including MPO, ELANE, DEFA4 and BPI. The MPO, ELANE and PADI4 peptides play a critical role in NET formation, driving chromatin decondensation, cell rupture and the extracellular release of DNA [43][44][45]. The significantly higher induction of genes in patients with high-PET scores compared to those with low-PET scores, including MMP8, links their increased expression to increased lung damage.
The induction of genes that encode neutrophil granule proteins, including bactericidal granule enzymes, is intriguing since expression of these genes is typically high in neutrophils during maturation when granule proteins are synthesized, and declines in mature cells [46,47]. A highly similar set of genes including AZU1, CAMP, CTSG, DEFA4, ELANE, LTF, and MPO is activated and forms part of a co-expression module in isolated low-density granulocytes (LDG) from systemic lupus erythematosus (SLE) patients and it is thought that LDGs are immature neutrophils that have been released into the circulation during granulopoiesis [45,48]. Elevated LDG levels are correlated with disease severity in TB patients, but in vitro studies indicated LDGs were generated from normal neutrophils after degranulation or NET formation [49,50]. These studies, however, did not investigate whether the expression of genes encoding the granule proteins were elevated in these cells. Given our analysis is on whole blood which includes multiple cell types, it is possible that the observed elevated expression of neutrophil related genes is a result of an increase in the proportion of neutrophils in the blood. This, however, could only partially account for the observed increases since the expression of a number of these genes including ELANE, DEFA4, BPI and LTF (Fig 4) are elevated > 4-fold (log2 FC > 2). Since neutrophils already constitute 50% to 70% of all circulating leukocytes, an increase in their abundance could not possibly account for the increased transcription observed. Nevertheless, the increased expression of neutrophil granule genes observed in this study may contribute to tissue destruction and TB pathogenesis since pro-inflammatory LDGs have an enhanced capacity to secrete NETs and granule peptides in TB and SLE patients [45,50].
While M.tb has been shown to stimulate NET synthesis [51], NETs have a limited ability to kill M.tb [52] and it has been suggested they rather provide a platform for extracellular M.tb replication that facilitates pulmonary lesion growth and drives the transition to infectious TB [9,53,54]. In TB patients' plasma NET, MPO and ELANE levels are correlated with TB severity [55], while elevated serum levels of citrunillated histone H3, a NET biomarker, are associated with lung cavitation and poor treatment outcome [56]. A number of NET components, including dsDNA, mitochondrial DNA, and granule proteinases function as immune-stimulatory molecules when released extracellularly [45] and have been identified as important drivers of immune-pathogenesis in both infectious and non-infectious human diseases [57]. [61,62] Given the early and sustained level of type I IFN signaling observed during TB progression, it is interesting to note that in TB susceptible mice, type I IFN signaling has been shown to induce NETosis through activation of interferon α and β receptor subunit 1, which is associated with enhanced mycobacterial growth at infection sites and enhanced TB pathogenesis [58]. The same study identified NETs in nectrotic lung lesions of TB patients that responded poorly to treatment. Further, serum from patients with autoimmune disorders that have elevated levels of type I IFNs, as well as exogenous IFN-α, has been shown to stimulate neutrophil NET production in vitro suggesting that type I IFN prime neutrophils for NET production [59][60][61]. In turn, self-DNA and antimicrobial peptides released with NETs, have been reported to induce the chronic activation of plasmacytoid dendritic cells and secretion of type I IFNs in SLE patients creating a positive feedback loop that prolongs the inflammatory response [62]. The upregulated type I IFN signaling observed in this study that precedes neutrophil activation and NETosis, along with the above-mentioned studies, is consistent with type I IFN signaling diving neutrophil activation and NETosis during TB progression and enhancing TB pathogenesis.
The co-induction of platelet and neutrophil-mediated defense responses starting as early as 13-18 months before diagnosis is consistent with their established dependent roles during immune responses [38,39,63]. The activation of platelet and neutrophil degranulation, including the induction of platelet and neutrophil derived granule chemokines and membrane proteins that mediate their co-migration and physical adhesion including P-selectin (SELP) and CXCL7 (neutrophil-activating peptide-2 NAP2) from platelets and selectin P ligand (SELPLG), CXCR1 and 2, and ITGB2/LFA1 from neutrophils is consistent with studies that document their physical interactions during defense responses [38,63]. Platelet-neutrophil adhesion induces intracellular signaling cascades that activate many neutrophil antimicrobial functions observed in this study including ROS production, phagocytosis and NETosis [39,64]. Indeed, the neutrophil membrane integrin ITGB2/LFA1 is required to mediate neutrophil-platelet adhesion that drives NET release in human sepsis [64]. It has been suggested that platelets function as a barometer, stimulating NET synthesis when bacterial levels exceed the neutrophils' capacity to control infection through alternative mechanisms [64,65].
Platelet-neutrophil complexes are implicated in the pathogenesis of pulmonary inflammation and acute lung injury [39]. In TB patients, platelet numbers and activity are increased [66,67] and the concentration of numerous platelet-derived mediators including P-selectin, RANTES and PDGF was increased and correlated with levels of tissue-degrading MMPs 1, 7, 8, and 9, in bronchoalveolar lavage samples [68]. The co-induction of platelet and neutrophil functions that are first observed 1-6 months before diagnosis and further activated in TB patients suggests a pathological role for these interactions in the late stages of TB development. This is further supported by the increased induction of these pathways in TB patients with high PET scores compared to those with low PET scores.
The current findings confirm and elaborate the findings of a previous study by Scriba et al. [25] that identified the induction of uncharacterized neutrophil and platelet gene modules 6 months before TB diagnosis. Here, we identified specific processes that are mediated by these cells, including neutrophil and platelet degranulation as well as NET formation that are activated 6 months before TB diagnosis. Our analyses additionally discovered that these processes are further activated in individuals with TB and at an elevated level in TB patients with increased lung damage. Collectively, these results support that the activation of these pathways is linked to TB progression and increased lung damage in TB patients.

Conclusion
The distinct co-induction of neutrophil and platelet degranulation as early as 13-18 months before diagnosis, NET formation 1-6 months before diagnosis, as well as their further activation in TB patients is consistent with these processes playing a critical role in the late stages of disease progression. This is further supported by the enrichment in upregulated genes that function in these pathways in TB patients with increased levels of lung damage. Platelet-neutrophil interactions are required for mediating their chemotaxis and recruitment to infection sites and for NET synthesis [69]. NETs are associated with TB pathogenesis and are thought to provide an extracellular platform for M.tb growth [53] while neutrophil granule enzymes degrade the ECM of the lung [12]. Collectively, these responses can lead to rapid lesion growth and tissue destruction that allows M.tb to disseminate into the airways [53]. The detectable activation of these specific processes in whole blood around 6 months before TB diagnosis makes them promising candidates for targeted therapeutic interventions that may limit lung damage and prevent progression to active TB.

Study outline and data sources
The whole blood RNAseq data analyzed was obtained from two independent data sets.
The data for TB progressors were generated as part of the the Bill and Melinda Gates Foundation GC6-74 program that was a longitudinal study of household members of newly diagnosed TB cases which was conducted across four African sites including South Africa, The Gambia, Uganda and Ethiopia. In brief, when a newly diagnosed TB case was identified, individuals with whom they shared a house for a minimum period of three months were recruited with the expectation that they would have been exposed (likely repeatedly) and infected with M.tb and were, therefore, at high risk of developing TB. A total of 4,466 household contacts were followed for two years and whole blood was collected at recruitment (time zero), 6 and 18 months thereafter for RNAseq. Further details of the study details have been described previously [22,23]. The samples have been used in several previous publications for validation and discovery of TB biomarkers [22,23]. The raw RNAseq FASTQ files for the study were downloaded from the GEO public database (accession number GSE94438).
The second data set used included a subset of RNAseq data from the South African Catalysis study (GEO: GSE89403) which was a longitudinal study of resolution of lung inflammation in TB cases. Here, we only used RNAseq data from TB patients at diagnosis and the non-TB controls [27,28] to enlarge the size of the sample of TB at diagnosis patients. The study recruited adult HIV-negative TB patients, whose diagnosis was confirmed with a sputum culture [27,28]. Asymptomatic individuals that were recruited from the same community and tested negative for TB on sputum and chest X-ray were used as controls. The results of 18-F fluorodeoxyglucose (FDG) PET-CT scans from TB patients at enrollment were used to categorize patients into groups with high and low levels of lung inflammation activity based on PET. The metric used is a sum of total glycolytic activity index (TGAI) [27,70] of all metabolically active lesions and the products of the mean lesion intensities with the cavity volumes, abbreviated ComTGAI, and is correlated with levels of lung inflammation [27,71] Patients were dichotomized into low and high PET scores based on a threshold of 4,000 units at baseline.

Ethics statement
Both source studies (the GC6-74 study and the Catalysis TB Biomarker study) were performed with ethical approvals and required written informed consent. For the GC6-74 study the following ethics approvals applied (as described in [23]

RNAseq and quality control
The data used in the current study had been obtained with RNA extracted from whole blood and sequenced on the Illumina HiSeq-4000 (GC6-74) and HiSeq-2000 (Catalysis) platforms generating 50 bp stranded paired-end reads. We used the FastQC program (version 0.11.5) [72] to assess the quality of the reads.

Read mapping
The Spliced Transcripts Alignment to a Reference (STAR) software (version STAR_2.5.3a) [73] was used to map reads to the Ensembl [74] human genome primary assembly (version GRCh38.89). The quantMode GeneCounts option was selected to generate raw genewise read counts for each sample.

Differential expression analysis
The DE analysis was performed in R [75] using the edgeR (version 3.26.8) [76] Bioconductor [77] package. Briefly, raw counts were filtered to remove genes with low expression, normalized, and negative binomial generalized linear models were fitted. In addition to time before diagnosis, sex and site were included as factors in the model matrix since they were observed to be responsible for the separation of principal component (PC)1 and PC2 based on multidimensional scaling plots.
The linear model applied was: In R syntax as: model.matrix(~0 +TimeBf+ sex + site), i.e., a model without an intercept The quasi-likelihood F-test (QLF-Test) was used to determine significance and identify DE genes. For the GC6-74 data set, all TB progressor groups were compared to the non-TB baseline group (time 0). For the Catalysis data set, the TB patients were compared to non-TB controls. TB patients categorized with high and low PET scores were also separately compared to controls and directly to each other to identify differences. Significantly DE genes were selected that had a false discovery rate (FDR) <0.05.
To determine the influence that the larger sample sizes [1-6 months before (n = 47) and TB (n = 90) compared to 13-18 months before (n = 18) and 7-12 months before (n = 19), see Table 1] had on the number of DE genes detected, the 1-6 months before diagnosis and TB groups with larger samples were subsampled. These groups were subsampled 1000 times at a sample size of 19 and a DE analysis was performed for each iteration.

Gene ontology and KEGG pathway analysis
The topGO [78] Bioconductor package was used to test for enrichments in any GO terms [79] associated with the DE genes. The GO graph structure was generated using both the "classic" and "weight01", algorithms using the Fisher's exact test to identify enriched terms. In brief, the classic algorithm tests each GO category independently. The "weight01" is the default algorithm used by the topGO package and essentially penalizes scores for more general terms that share genes with more specific neighboring terms, weighting the analysis for the identification of more specific and therefore informative ontologies. "Weight01" is a mixture of two algorithms, "elim" which eliminates genes shared between a node and its ancestor, and "weight" which weights genes locally as a function of the ratio between child and ancestor nodes [78]. The weight01 mixture of the two algorithms moderates the extremes of the two individual algorithms.
The kegga function in edgeR was used to perform a KEGG [80] pathway enrichment analysis on the DE gene sets. The Bioconductor [77] pathview package [29] was subsequently used to visualize gene expression on enriched KEGG pathway graphs.
Due to the extensive overlap of genes in the hierarchical GO categories, neither the topGO or kegga programs correct for multiple hypothesis testing and recommend against it. We therefore applied a stringent p-value cut-off of <5x10 -4 to identify enriched pathways.
Supporting information S1 Table. PET scores for the catalysis study specimens. The PET score (low-PET, high-PET as defined in Methods) for the Catalysis study specimens used in the current study. (XLSX) S2 Table. List of differentially expressed genes at all time points. The statistical metrics presented for each comparison include: log2 fold change (log2FC), average log2 counts per million (logCPM), quasi-likelihood F-statistic (F), p-value (PValue) and false discovery rate (FDR). The non-TB time zero baseline group from the GC6 study were the control group for all comparisons prior to TB diagnosis (i.e., 24, 18, 12 and 6 months before diagnosis) while the Catalysis study healthy controls served as the baseline for the TB comparison. (XLSX) S3 Table. Gene ontology (GO) enrichment analysis results for significantly upregulated genes at all time points. The analysis was performed using the topGO R Bioconductor package. The statistical metrics presented for each process include: the total number of genes annotated to the process (Total annotated), the number of genes that were significantly up (N up) -regulated, the percent of total annotated that were up-regulated (% Up) and the number that were expected by chance (Expected). The uncorrected Fisher's exact test p-value (Pvalue) and overall rank for over-representation of the GO term in the set using both the classic and weight01 (W1) algorithms are presented. (XLSX) S4 Table. KEGG pathway enrichment analysis results for significantly differentially expressed genes at all time points. The analysis was performed using the kegga function from the edgeR R Bioconductor package. The statistical metrics presented for each pathway include: the total number of genes annotated to the pathway (Total annotated), the number of genes that were significantly up (N Up) or down (N Down) -regulated, the % of the total annotated that were up (% Up) or down (% Down) -regulated and the uncorrected Fisher's exact test pvalue (Pvalue) for up and down comparisons (P Up, P Down) for over-representation of the KEGG pathway in the set. (XLSX) S5 Table. KEGG and gene ontology (GO) enrichment analyses for significantly differentially expressed genes between low and high PET score groups for the terms and pathways identified in the main analyses. The analyses were performed using the topGO and kegga function from the edgeR R Bioconductor packages. The statistical metrics presented are as for S2 and S3 Tables respectively. (XLSX) S1 Fig. Density plot of subsampling the 1-6 months before TB diagnosis sample numbers. Subsampling was performed on the 1-6 month-before TB group to investigate the effect the larger sample size (n = 47) had on the number of significantly up-regulated genes identified. A total of 1000 differential expression analysis tests were performed using a sample size of 19 which is similar to the earlier 12 month before (n = 19) and 18-month before time points (n = 18). The mean number of significantly up-regulated genes identified in the subsampling was 1,436 (95% CI:120, 3,100). This number was lower than the 2,791 identified with the full 47 samples but far higher than the number identified in the 18 (249; n = 19) and 12 (492; n = 18) months before diagnosis groups. (TIF) S2 Fig. Density plot of subsampling the TB group sample numbers. Subsampling was performed on the Catalysis study TB group to investigate the effect the larger sample size (n = 90) had on the number of significantly up-regulated genes identified. A total of 1000 differential expression analysis tests were performed using a sample size of 19 which is similar to the earlier 12 month before (n = 19) and 18-month before time points (n = 18). The mean number of significantly upregulated genes identified in the subsampling was 3,485 (95% CI:2,719, 4,190). This was lower than the 4,391 identified with the full 90 samples but far higher than the number identified in the 18 (249; n = 19) and 12 (492; n = 18) months before diagnosis groups. (TIF)